Introduction

Within the City of New York, does the prevalence and handling of graffiti differ between boroughs? The answer to these two questions is yes, and yes as shown by t-tests that we conducted to compare the mean length of time to resolve cases between boroughs.

Background

The data set contains 21303* observations which are each an instance of graffiti. Each observation has the borough, and neighborhood the graffiti was reported in, the date the incident was filed, and whether or not the case is open or closed. If the case is closed, there is a date for when the incident was closed. The data was collected via reports of graffiti made to the proper authorities, and the data was published by the Department of Sanitation (DSNY).

One major anomaly in the data is the number of cases in which the incident is not closed. Of the 21303 observations, 7,669 are still “ongoing”– this number includes observations that were created over a year ago.

Our goal for the rest of the report will be to analyze and compare and contrast the length of time from a case being opened to being closed between boroughs and whether there is evidence to say with statistical significance that the average case length does indeed differ between boroughs. Furthermore, we will include graphics that not only illustrate the aforementioned but highlight other areas of interest (ie graffiti per capita in the boroughs).

Analysis

First, we tidy up and transform the data to fit the needs of our analysis by creating a column for the number of days for a case to be closed. We also made a column to say whether or not a case is open or closed. But first we will be looking at rates of graffiti in New York without respect whether the cases are open or closed: just prevalence within the boroughs.

Image of New York City with Each Point Being a Separate Case of Graffiti

Color Coded by Borough:

Manhattan: Green

The Bronx: Pink

Queens: Red

Brooklyn: Orange

Staten Island: Blue

An image of New York City

An image of New York City

Prevalence of Graffiti in NYC

borough cases population case_per_cap pop_per_case
Bronx 3808 2717758 0.0014012 713.6970
Brooklyn 9832 4970026 0.0019783 505.4949
Manhattan 4665 3123068 0.0014937 669.4680
Queens 2696 4460101 0.0006045 1654.3401
Staten Island 299 912458 0.0003277 3051.6990

Addressing “Complete” Data As mentioned above, 7669 cases in the data set are not closed– we will address those cases later in the project. For now, we will only be looking at observations for which there is an open and a closed date.

borough mean sd total
Bronx 95.97027 35.11295 2119
Brooklyn 103.44818 36.15245 6667
Manhattan 105.43415 33.09512 3045
Queens 107.69930 38.66792 1563
Staten Island 126.26250 54.94030 240

Tests for Statistical Significance of Closed Data

Looking at these bar graphs, boxplot, and density plot alone, it seems that the average number of days to resolve a case is quite similar between the boroughs.

Is there a difference between the average length of time to close a graffiti case for all borough’s? Conducting an ANOVA test:

Alternate Hypothesis: The means are different

Null Hypothesis: The means are the same

We expect a p-value of ~0.0 if the alternate hypothesis is true

anova_data = closed_graffiti %>%
  group_by(borough) %>%
  summarize(count = n(),
         mean = as.numeric(mean(length)),
         sd = sd(length))


res.aov <- aov(length ~ borough, data = closed_graffiti)
summary(res.aov)
##                Df   Sum Sq Mean Sq F value Pr(>F)    
## borough         4   283272   70818   54.48 <2e-16 ***
## Residuals   13629 17714759    1300                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(res.aov)[[1]][["Pr(>F)"]][[1]]
## [1] 1.214099e-45

The resulting p value from the ANOVA test is 1.214099e-45. From this information we can prove with a high degree of statistical certainty that the means are different and the alternate hypothesis is true.

Is there evidence to suggest that there is indeed a statistically signifigant difference between boroughs however? We use t tests to compare.

# Algorithm: Compare two means to see if the difference between the two is statistically significant.
# return: the  score from a two sided test assuming the null hypothesis is true. 
t_test = function(mean1, sd1, total1, mean2, sd2, total2)
{
  z = (mean1 - mean2) / sqrt((sd1*sd1 / total1) + (sd2*sd2 / total2))
  if (total1 > total2) {
    df = total2 - 1
  }
  else { df = total1 - 1 }
  
  p_val = pt(z, df)
  return (p_val)
}

Is there a difference between how the average length of time to close a graffiti case in Manhattan vs Brooklyn?

Conducting a t-test results in the following:

# Ho: mu Manhattan = mu Brooklyn
# Ha: mu Manhattan =/= mu Brooklyn
t_test(Brooklyn$mean, Brooklyn$sd, Brooklyn$total,
       Manhattan$mean, Manhattan$sd, Manhattan$total)
## [1] 0.003880954

We can expect to see the difference of these means occuring 0.388% of the time. Thus, there is a high statistical significance that the mean number of days to close a graffiti case differs between Brooklyn and Manhattan.

Is there a difference between how the average length of time to close a graffiti case in Queens vs Bronx? Conducting a t-test results in the following:

# Ho: mu Queens = mu Bronx
# Ha: mu Queens =/= mu Bronx
t_test(Bronx$mean, Bronx$sd, Bronx$total,
       Queens$mean, Queens$sd, Queens$total)
## [1] 5.631018e-21

We can expect to see the difference of these means occuring ~0.0% of the time. Thus, there is a very high statistical significance that the mean number of days to close a graffiti case differs between Queens and the Bronx.

Addressing Missing Data

As stated in the introduction, 36% of the observations are not “closed”. For many this might be because the city has not yet had the opportunity to clean up the graffiti, but for some it might be due to administrative error (the case is closed but was not marked as such) or the city forgot about the case.

To address this we created a new df ‘open_graffiti’ which contains only the 36% of observations that are not closed. We used mutate to add a column ‘since_open’ which is the number of days since the most recent date (Oct 14 2019). For some this value will be 0 because the case was opened the day this data was publish, for others it will be a number of days exceeding a year and everything in between.

Top Six Observations on the open_graffiti Dataset

address borough created closed status long lat length since_open
114 west 14th street Bronx 2019-03-07 NA Open NA NA NA 221 days
101 E 163 street Bronx 2019-02-19 NA Open NA NA NA 237 days
121N CHRYSTIE STREET Manhattan 2019-01-17 NA Open -73.99347 40.71886 NA 270 days
792 ST NERI WAY Bronx 2019-01-07 NA Open NA NA NA 280 days
146 rockaway ave Brooklyn 2018-12-18 NA Open -73.91078 40.67811 NA 300 days
597 39 STREET Brooklyn 2018-11-28 NA Open -74.00207 40.65016 NA 320 days

How do we separate out cases that are still being worked on vs cases where extending factors might be at play?

Here we take the 95th percentiles for ‘length’ from the df ‘closed_graffiti’. Our logic is that if the value of ‘since open’ for an open case is greater than the value of the 95th quantile of ‘length’ of closed_graffiti, we might assume either or a combination of the following.

  1. There was an administrative error and a case actually is closed but was not marked as such.
  2. The case was simply forgotten about or ignored by the Department of Sanitation.
  3. Another factor we have failed to consider.
## Time difference of 154 days

Of all closed observations of graffiti in NYC, 95% of the cases were closed in 154 days or less. Now, we will take the number of cases by borough that are older than 154 days over the total number of open cases in the borough. This will give us 5 proportions, 1 for each borough, of ‘issue’ cases / total_cases. The larger the proportion, the more cases in that particular borough that are open AND ‘very old’.

borough total_issue total_open prop
Bronx 463 1689 0.2741267
Brooklyn 904 3165 0.2856240
Manhattan 514 1620 0.3172840
Queens 442 1133 0.3901147
Staten Island 21 59 0.3559322

This graph displays those proportions. In addition, the size of each plot corresponds to the size of the mean number of days to close a case. The lower the placement of the dot, the less ‘issue cases’ that borough has. The smaller the dot, the faster closed cases are closed. Simply, a smaller lower placed dot can be interpreted as “good” and a larger higher placed dot as “bad”.

Maps

Discussion

We can see from the ANOVA test and t-tests that there is indeed a statistically significant difference in the average number of days for a graffiti case to be resolved between boroughs. Surprisingly to us the borough that has the smallest mean– the fastest case closure per se– is the Bronx. We thought that it would be Manhattan given that tourists frequent Manhattan more than any other borough and we believed the city would want tourists to see as little graffiti as possible. Furthermore, based upon the graffiti per capita we learned that Staten Island while having the smallest population of the five boroughs has the most graffiti per person. An interesting route of inquiry to pursue regarding this data might compare the prevalence of another type of crime– ie burglary– with the prevalence of graffiti and ask questions such as “is there an association between the prevalence of graffiti and other crimes, and does such other crimes follow the same patterns by borough as graffiti?”.